Experimental evidence of effective human–AI collaboration in medical decision-making

Artificial Intelligence (ai) systems are precious support for decision-making, with many applications also in the medical domain. The interaction between mds and ai enjoys a renewed interest following the increased possibilities of deep learning devices. However, we still have limited evidence-based knowledge of the context, design, and psychological mechanisms that craft an optimal human–ai collaboration. In this multicentric study, 21 endoscopists reviewed 504 videos of lesions prospectively acquired from real colonoscopies. They were asked to provide an optical diagnosis with and without the assistance of an ai support system. Endoscopists were influenced by ai (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {or}=3.05$$\end{document}OR=3.05), but not erratically: they followed the ai advice more when it was correct (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {or}=3.48$$\end{document}OR=3.48) than incorrect (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {or}=1.85$$\end{document}OR=1.85). Endoscopists achieved this outcome through a weighted integration of their and the ai opinions, considering the case-by-case estimations of the two reliabilities. This Bayesian-like rational behavior allowed the human–ai hybrid team to outperform both agents taken alone. We discuss the features of the human–ai interaction that determined this favorable outcome.


A.1 Experimental stimuli
Lesions used in the present study were acquired in a clinical study named CHANGE (Characterization Helping the Assessment of Colorectal Neoplasia in Gastrointestinal Endoscopy, Clinicaltrials.gov registration NCT04884581). CHANGE study is a single-center, single-arm, prospective study investigating the use of GI Genius CADx device in the real-time characterization of colorectal polyps (i.e., prediction of their histology during the colonoscopy). Patients enrolled underwent a standard white-light colonoscopy with the support of GIG CADx device v3.0 (manufactured by Cosmo AI/Linkverse and distributed by Medtronic). All the colonoscopy videos considered in the study were acquired in full length with unaltered quality, bearing no trace of the AI used (no overlay). Patients' clinical data and polyp histopathological information were saved in an electronic Case Report Form. The localization of each polyp in each patient was carefully annotated by scientific annotation experts. This was confronted with data in the Case Report Form for the same patient to avoid any possibility of erroneous correspondence between polyp in the video and the related histology. Polyps for which video recording failed or for which no histology could be obtained were excluded. For each polyp, a short video clip was prepared, starting a few seconds before the first polyp appearance and ending with endoscopic polyp resection. If multiple polyps were present in the same video section, a separate clip was generated for each individual polyp. For each lesion, two different video clip versions were prepared. In a first version, to be used in S1, the clip had no markings except for a green box surrounding the target lesion for the first ten seconds after the lesion appearance. In the second version, to be used in S2, the video clip was processed with GIG algorithms, and the output was identical to that observed during real-life utilization of the GIG medical device. The labels analyzing/adenoma/non-adenoma/no-prediction were visible on the lesion.

A.1.1 Example video clips
We provide in the supplementary online materials three video clips from the experimental stimuli set. The output of the AI visible in the video clips is the original generated by the CADx, and it was shown in Session S2. The same lesion was previously shown in Session S1 without the label of the CADx, i.e., only with the CADe green box.  Frame of a video clip of a colonoscopy showing a lesion automatically labeled by the AI system as "non-adenoma". The full video clip is available in the supplementary material online as V2_example_nonadenoma.mp4.

A.1.2 GI Genius CADx device
The CADx+CADe device used in this study is GI Genius v3.0 (developed by Cosmo AI, Ireland, and distributed by Medtronic, US). GI Genius v3.0 received CE clearance under the European Medical Device Directive (MDD, 93/42/EEC) in 2021 as a class IIa medical device. The CADx is designed to automatically activate when a new polyp is detected by a CADe algorithm in a colonoscopy video stream. For each polyp, a CADx algorithm overlays a frame-by-frame live decision specifying its binary histology ("adenoma" or "non-adenoma"). The CADx can also abstain from predicting the polyp histology in a frame either by printing "no-prediction" if the system is unsure about the histology or "analyzing" if an insufficient number of features across multiple frames was detected. The CADx algorithm training details and its performances are fully reported by the manufacturer (Linkverse, Cosmo AI) in the documentation accompanying the device. Briefly, the CADx neural network was trained using data extracted from the study "The Safety and Efficacy of Methylene Blue MMX Modified Release Tablets Administered to Subjects Undergoing Screening or Surveillance Colonoscopy" (ClinicalTrials.gov NCT01694966), a multinational, multicenter study that enrolled over one thousand patients. The study recorded lossless, high definition, full procedure colonoscopy videos and complete information on polyp characteristics and histology. The histopathological evaluation was based on the revised Vienna classification of gastrointestinal epithelial neoplasia. The polyps corresponding to Vienna category 1 (negative for neoplasia) or 2 (indefinite for neoplasia) were considered "non-adenoma". The polyps corresponding to Vienna category 3 (low-grade mucosal neoplasia), 4 (high-grade mucosal neoplasia), or 5 (submucosal invasion of neoplasia), were considered "adenoma". The video dataset thus obtained was further split into training, validation, and test subsets. Frames containing a polyp were manually annotated by trained personnel. Overall, for training, we collected 345 patients, 957 polyps, and 63445 images; for validation, we had 44 patients, 133 polyps, and 8645 images; for test, 165 patients, 405 polyps, and 26412 images.

A.2.1 Overview
After protocol registration, we sent invitations to expert and non-expert reviewers. Experts were endoscopists having more than 5 years of colonoscopy experience, a proven track of scientific publications on optical characterization or a similar subject, and experience in optical biopsy with virtual chromoendoscopy. An endoscopist was considered "non-expert" if s/he had performed less than 500 colonoscopies. We sent following waves of 20 invitations until we surpassed the minimum target number of participants (16 participants, 8 experts, and 8 non-experts). We sent 40 invitations, 25 endoscopists accepted the invitation, and 21 completed the review of all lesions within the deadline. The average time needed to complete the survey was 55 days.

A.2.2 Procedure
Each batch of 84 lesions (B01, B02, . . . , B06) was administered as S1 and as S2. At least two weeks passed from the conclusion of the evaluation of one batch in S1 and the evaluation of the same batch in S2. The sequence of administration was the following: For each batch, 21 pre-randomization lists were prepared for the order of presentation of lesions. Each reviewer was preassigned a pre-randomization ID, and lesions in each batch were presented to the reviewer with the order corresponding to the pre-randomization ID.

A.2.3 Questions of the online survey
Data were collected through an online survey. Each participant was provided with a unique username. The video interface allowed the user to fully control what to play, allowing the endoscopist to skip, view again, and pause. We recorded the viewing time for each frame. The participants answered the following questions.
indicating whether the ith lesion is correctly diagnosed (D i j,AI = 1) or not (D i j,AI = 0) by the AI. This quantity is referre to as AI (perceived) correct diagnosis. Finally, the discrete numerical variable S i jk ∈ {1, . . . , 9} measures the belief of the jth endoscopist about the ith lesion during the kth session. For example, the score S i jk = 9 indicates a strong belief of the jth endoscopist in session k that the ith lesion is an adenoma. Viceversa, S i jk = 1 denotes a strong belief of the jth endoscopist in session k that the ith lesion is not an adenoma. The confidence score is defined as a combination of the human judgment variables and the associated human confidence variables. Hence, we let S i jk = 1, 2, 3, 4 when the associated judgement is "Non-Adenoma" and the confidence ranges between "Very high" to "Very low". Similarly, we let S i jk = 6, 7, 8, 9 when the associated judgement is "Adenoma" and the confidence ranges between "Very low" to "Very high". We set S i jk = 5 when the judgments and hence the confidence are "Uncertain". We call these quantities confidence scores for S1 and S2.

A.4.1 Definition of the primary inferential quantities
We were interested in comparing the probability of a certain event during session 1 with the probability of the same event during session 2, possibly considering a subset of the observations. We made use of odds ratios and we define the function odd(π) = π/(1 − π) for any π ∈ (0, 1). For example, the main hypothesis (2.) in Section 2.3 aims at comparing the probabilities of the correct diagnosis between S1 and session 2. We define the odds ratio for j = 1, . . . , J and i = 1, . . . , n. This odds ratio captures the magnitude of AI's effect on diagnosis. If its value is greater than (less than) 1 it means that AI's assistance is associated with an increased (decreased) probability of a correct diagnosis. Hence, this quantity is the main measure of interests for addressing the main hypothesis (2.) described in Section 2.3.
The second important aspect of the analysis is the estimation and quantification of the effectiveness and safety measures, which represent the AI effect on the diagnostic accuracy conditionally on the fact that the AI was correct (effectiveness) or incorrect (safety). These indices corresponds to the main hypothesis (3.) and (4.) in Section 2.3. We defined the effectiveness as the following odds ratio for any i ∈ E j and j = 1, . . . , J, where we define the index set E j = {i = 1, . . . , n : D i j,AI = 1}. Similarly for the safety we let for any i ∈ S j and for j = 1, . . . , J, where we define the index set S j = {i = 1, . . . , n : D i j,AI = 0}. The last important aspect of our analysis was the influence of the AI, irrespective of its correctness, as described in hypothesis (1.) in Section 2.3. Once again we relied on odds ratios and we defined: for i ∈ A j = {1, . . . , n : A i j / ∈ {"Uncertain", "I did not notice"}} and for j = 1, . . . , J, representing a measure of the shift of opinion of the endoscopist towards the perceived AI response.
The measures presented in this section (AI effect on accuracy, effectiveness, safety, AI influence) do not depend neither on the endoscopist nor on the lesions. This important assumption is a consequence of the logistic regression specification discussed in A.4.2 and crucially facilitates the interpretation of these indices.

A.4.2 Logistic regression modeling
We present the statistical modeling in an abstract fashion, as all the main hypothesis described in Section 2.3 (influence, diagnostic accuracy, effectiveness and safety), can be tested by relying on the same statistical tools. The quantities described in the previous section involve independent binary indicators y i jk ∈ {0, 1}, associated to the ith lesion, the jth endoscopist, and the kth session. Moreover, recall that we are interested in estimating odds ratios of the form odd{pr(y i j2 = 1)}/odd{pr(y i j1 = 1)}. Let us define the following probabilities

5/25
where we consider a subject-specific generic subset I j ⊆ {1, . . . , n}. Then, we assume the following logistic regression specification and for j = 1, . . . , J. We regarded β as an unknown parameter, whereas the lesion-specific µ i and subject-specific α j values were modelled as Gaussian random effects for i ∈ I j and j = 1, . . . , J. This specification leads to a straightforward interpretation β in terms of odds ratio. In fact, we can express the key parameter as follows for any i ∈ I j and j = 1, . . . , J. We use the following notation for these key parameters: (9)

A.4.3 Estimation and testing procedure
The models we described are logistic regression models with (Gaussian) random effects. The estimated parameters were obtained via (integrated) maximum likelihood and exploiting the lme4 R package 1 . Related inferential quantities (e.g. confidence intervals, hypothesis testing) were computed using the lme4 and/or exact2x2 R packages 2 .

A.4.4 Inferential goals
The primary inferential goals were the main hypotheses (1.), (2.), (3.), and (4.) discussed in Section 2.3: 1. Influence of the AI. The opinion of the endoscopist is influenced by the response of the AI. This amounts to testing the following hypothesis H 0 : ω I ≤ 1 against H 1 : ω I > 1, where ω I represents the AI influence as defined in equation (6).
2. Diagnostic accuracy. If the interaction between endoscopists and AI is well-calibrated, the diagnostic accuracy should improve. This amounts to testing the null hypothesis H 0 : ω A ≤ 1 against the alternative H 1 : ω A > 1, where ω A represents the AI effect on the diagnostic accuracy defined in equation (7).
3. Effectiveness. The endoscopists are rightfully accepting the AI opinion when this is correct,. This amounts to testing the null hypothesis H 0 : ω E ≤ 1 against the alternative H 1 : ω E > 1, where ω E represents the effectiveness defined in equation (8).
4. Safety. The endoscopists are able to disengage from a wrong opinion of the AI. This amounts to testing the null hypothesis H 0 : ω S ≤ ∆ S against H 1 : ω S > ∆ S , where ω S represents the safety defined in equation (9) and ∆ S is the safety limit. We let ∆ S = 0.3.
The aforementioned null hypotheses were tested sequentially at level α with a fixed-sequence multiple testing procedure 3 . If a certain hypothesis is rejected at the α level, the next hypothesis in the sequence is tested at the same level. Otherwise, testing stops, and subsequent hypotheses are not tested. The fixed-sequence procedure controls the family-wise error rate at level α; namely, the probability of at least one incorrect conclusion (type I error) is bounded by α. For the inference on a generic odds ratio ω, we calculated the p-value for H 0 : ω ≤ 1 (H 0 : ω ≤ ∆) and the point estimateω along with the (1 − α)% confidence lower bound ω α . If we observe a p-value less than α and/or ω α > 1 (ω α > ∆), then H 0 : ω ≤ 1 (H 0 : ω ≤ ∆) will be rejected at level α. The significance level α is set to 5%.

A.5 Power analysis for main inferential goals
The study is successful when there is a positive outcome in all endpoints: AI influence, AI effect on accuracy, effectiveness, and safety. In the following power analysis, we will focus on the safety endpoint. Given the fewer observations available, the safety endpoint is the least powered analysis of all endpoints. Thus, it provides a lower bound for the power of the other endpoints.
We performed a simulation-based power analysis for the safety endpoint based on the logistic regression model described in Section A.4.2. We considered (i) a prospective power calculation based on parameter estimates derived from a pilot study and (ii) a retrospective power calculation based on parameter estimates derived from the current study.  Parameter values used in the simulation-based power analysis for the safety endpoint: number of subjects (endoscopists) J, number of lesions relevant to safety n S , safety effect size ω S = e β , intercept term µ and variances σ 2 µ and σ 2 α of the lesion and the subject random-effects.
The parameters required for the power analysis are the number of subjects (endoscopists) J, the number of lesions relevant to safety n S (i.e., the number of AI incorrect diagnoses), safety effect size ω S = e β , the intercept term µ and the variances σ 2 µ and σ 2 α of the lesion and the subject random-effects. Table A.1 shows the parameter values estimated from the pilot study (prospective power analysis) and the main study (retrospective power analysis). The pilot study involved J = 5 subjects (endoscopists) evaluating n = 509 lesions, where the number of lesions relevant to the safety endpoint is n S = 147. However, in the prospective power calculation, we used n S = 100 since the main study would use the next version of the AI engine, featuring a higher AI accuracy. The estimate for β from the pilot study is −0.1205 with a standard error of 0.128; we used the β estimate minus twice the standard error for the safety effect size. The number of observations obtained by the subject-lesion combination is 16 × 100 = 1600 for the prospective analysis and 21 × 100 = 2100 for the retrospective analysis. The parameters specification for the retrospective analysis gives a probability of correct diagnosis in the subset of lesions where AI is wrong equal to 78.8% in S1 and 66.8% in S2, on average over lesions and subjects. The parameters specification for the prospective analysis gives a probability of correct diagnosis in the subset of lesions where AI is wrong equal to 68.8% in S1 and 60.2% in S2, on average over lesions and subjects. The safety effect size estimate of 0.686 based on the pilot study is higher than the estimate of 0.542 based on the main study, and this may be partly due to the lower probability of a correct diagnosis in the subset of lesions where AI is wrong in the pilot study.
The safety endpoint is achieved when the lower bound ω S of the 95% confidence interval for the safety effect size ω S is greater than the safety limit ∆ S . We declared in the preregistration a safety limit ∆ S = .3. Fig. A.4 displays the probability of success for the safety endpoint (power) as a function of the safety limit ∆ S . Power is estimated by simulation, using 1000 replications, by setting the parameter values in Tab. A.1. For the declared safety limit ∆ S = .3, the estimated power is > 99% both in the prospective and the retrospective analyses. In conclusion, a sample size J = 16 warrants adequate power for our main endpoints.

A.6 Statistical modeling of ancillary effects on endoscopist beliefs
A further refinement of the change in agreement is possible by conditioning on the outcomes of the AI, i.e. when AI = d with d ∈ { "Adenoma", "Non-Adenoma"}. We define ("Change in agreement when AI for any i ∈ D j = {1, . . . , n : A i j = d} and for j = 1, . . . , J. The odds ratio represents the shift of opinion towards the perceived AI response when AI = d. We adopted the logistic model specification described in A.4.2 by conditioning on AI = d, for d ∈ {"Adenoma", "Non-Adenoma"}. We were also interested in comparing the measure of belief of the jth endoscopist about the ith lesion in session 1, encoded in the confidence score S i j1 , with the measure of belief expressed in session 2, encoded in the confidence score S i j2 . For estimating the average AI effect on the confidence score when AI = d with d ∈ {"Adenoma", "Non-Adenoma"}, we defined ("Effect on confidence score when AI for any i ∈ D j = {1, . . . , n : A i j = d} and for j = 1, . . . , J. Then, we assumed the following linear regression specification where τ i , κ j are nuisance parameters that will be modelled as Gaussian random effects and fixed effects, respectively. The key parameter δ captures the shift in confidence score between the two sessions when AI = d, with δ greater than (less than) 0 indicates that AI = d is associated with an increased (decreased) confidence towards "Adenoma".

B.1 Endoscopists' processing performance
We asked whether the time required to judge a lesion is impacted by the availability of AI opinions. Endoscopists took, on average, a similar amount of time to decide in S1 and in S2. Notwithstanding that the presence of AI has a minimal effect on decision times, these are modified by other factors. Namely, confidence and the agreement between endoscopists and AI.
Endoscopists take longer to decide, in both sessions, when they are less confident (Fig. B.5), and they take longer when their opinions disagree with AI (Fig. B.6). These effects on decision times may be understood by considering the relative difficulty of the lesions. On the one hand, when lesion difficulty is taken into account, the effect of confidence on decision times mostly disappears. On the other, the "disagreement effect" in S2 almost disappears when it is assessed by discarding the decision times of the same lesion subgroups in S1 (when no disagreement with AI is possible). This suggests that a large part of the "disagreement effect" was due to the differential lesion difficulty in the agreement and disagreement subgroups. Non-experts were faster than experts both in S1, and in S2 (Fig. B.7). However, experts' responses slowed down in S2 more than non-experts' responses, meaning that experts spent more time in assessing AI's opinions.
Odds-ratio estimates in italics are significantly different from 1 at level 5% after Hommel's multiplicity correction across the 21 subjects (i.e. adjusted p-value < 0.05).

B.2 Individual variability
This section explores whether the behavioral patterns observed in our global analyses also hold for each endoscopist. Estimated odds-ratio ω I (influence), ω A (accuracy), ω E (effectiveness) and ω S (safety) for individual endoscopists are reported in Tab. B.2, Fig. B.8 and Fig. B.9 Estimated marginal probability (i.e. proportions) of agreement, accuracy, effectiveness, safety and AI perceived accuracy for individual endoscopists in S1 and S2 are reported in Fig. B.10). The great majority of endoscopists increased their accuracy in S2 with respect to S1 (17 out of 21, that is about the 81%). Expert endoscopist were in general more confident than non-experts both towards themselves and toward AI (Fig. B.11). Overall, the reported individual-level observations encourage the conclusion that our main findings are robust to the effect of individual variability. and S2 for individual endoscopists, separately by expertise.

14/25
Endoscopist Expertise S1 accuracy S2 accuracy AI accuracy u AI accuracy s Human diagnostic accuracy (in S1 and in S2) and AI accuracy of the perceived diagnosis. Following the standard in the field, AI accuracy s does not consider as wrong lesions where AI opinion was perceived as "uncertain" or it was "not noticed". AI accuracy u more conservatively reports accuracy by considering "uncertain" and "not noticed" as wrong. The difference between the two AI accuracies is thus due to the varying proportion of lesions classified as "uncertain" and "not noticed" by each endoscopist.  Figure B.11. Confidence distribution in expert and non-expert endoscopists towards themselves (Session S1, S2), and towards AI.

B.3 Effects on endoscopists beliefs
A well-calibrated endoscopist should always consider and integrate informative opinions of the AI. This should be visible in the final endoscopist decisions. However, even when endoscopists do not change their decision, we should observe a belief modification on a more fine-grained scale, such as the confidence scores. Namely 1. The change in confidence -adenoma. We expected a shift of the endoscopist "confidence score" towards adenoma (i.e. towards score 9) when AI opinion equals "Adenoma".
2. The change in confidence -non-adenoma. We expected a shift of the endoscopist "confidence score" towards the non-adenoma (i.e. towards score 1) when AI opinion equals "Non-Adenoma".
Results are reported in Tab. B.4. On average, the confidence score increases by 0.72 when the perceived AI response is "Adenoma", and decreases by 0.73 when "Non-Adenoma" . Fig. B.12 displays confidence score frequency distribution in the two sessions S1 and S2 as a function of perceived AI response. Perceived AI = 'uncertain / not sure / did not notice' Figure B.12. Influence of the AI: confidence score frequency distribution in the two sessions S1 and S2 as a function of perceived AI response.

B.4 Visual determinants of AI perceived confidence
We explored what the visual determinants of the endoscopist perception of the AI confidence are. We consider two main explanatory variables: 1. Persistence, i.e., the highest number of consecutive frames in which the majority diagnosis (i.e., adenoma or non-adenoma) was shown, divided by the total number of frames with any AI output; 2. Proportion of 'No-prediction' frames in the AI output, i.e., the ratio between frames labeled 'No-prediction' and the number of frames where AI outputs 'Adenoma', 'Non-Adenoma' or 'No-prediction' (thus excluding 'Analyzing').
The 504 videos, each representing one lesion, are composed of 3285 frames on average (ranging from 575 to 16904). The lesion of interest was detected (i.e., the lesion is visible and the CADe highlights it with a box) on 1444 frames on average (ranging from 48 to 4859) displaying AI's dynamic output with labels 'Analyzing', 'Adenoma', 'Non-Adenoma' and 'No-prediction'.   Table B.5. Raw accuracy in Session S1 and S2,and AI perceived correct diagnosis, separately by expertise of the endoscopists. Following the standard in the field, AI accuracy s does not consider as wrong lesions where AI opinion was perceived as "uncertain" or was "not noticed". We also report the more conservative AI accuracy u , which considers "uncertain" or "not noticed" as errors.

B.5 Detailed descriptive statistics
In this section we report detailed descriptive statistics for raw accuracy (Tab. B.5), specificity (Tab. B.6), sensitivity (Tab. B.7) and our main endpoints (OR for influence, accuracy, effectiveness and safety) separated for histologic classification (Tab. B.9).  Table B.6. Specificity of human diagnosis (S1 and S2) and of AI perceived diagnosis. AI s does not consider as wrong lesions where AI opinion was perceived as "uncertain" or was "not noticed". We also report the more conservative AI u , which considers "uncertain" or "not noticed" as errors.
Sens. human (S1) Sens.  Table B.7. Sensibility of human diagnosis (S1 and S2) and of AI perceived diagnosis. AI s does not consider as wrong lesions where AI opinion was perceived as "uncertain" or was "not noticed". We also report the more conservative AI u , which considers "uncertain" or "not noticed" as errors..
Receiver operating characteristic (ROC) have been widely used to evaluate the diagnostic performance in multireader, multicase (MRMC) studies, e.g. to compare AI and human experts for breast cancer detection in mammography 4 .
In our experiment, multiple endoscopists (i.e., readers) evaluated multiple lesions (i.e., cases). For each lesion, each endoscopist rated his/her level of confidence towards the diagnosis (i.e., the "confidence score" ranging from 1 to 9, where 1 is "Very high confidence for Non-adenoma" and 9 is "Very high confidence for Adenoma"). The ground truth (i.e. "Adenoma" or "Non-adenoma") corresponded to the outcome of the histological evaluation.
The diagnostic performance of each endoscopist is described by a ROC curve showing the true positive proportion (Sensitivity) as a function of the false positive proportion (1 -Specificity), while the decision threshold varies. In the present case, the threshold is the specific confidence score that would be high enough for diagnosing an "adenoma". The ROC curve illustrates the trade-off between the sensitivity and the specificity of an endoscopist across all possible thresholds of the confidence score.
The area under an ROC curve (AUC) is a popular summary of the ROC curve, with higher AUC values indicate better performance, irrespective of a chosen threshold. Thus, the comparison in performance between S2 and S1 can be based on the difference between AUC-S2 (AUC of S2) and AUC-S1 (AUC of S1), with the assessment of the statistical significance of the difference taking into account the paired sample design. Figure B.15 displays the ROC curves of all endoscopists, separately for S1 and S2. In 19 out of 21 endoscopists, AUC-S2 is larger than AUC-S1, indicating a better performance in S2 for most of the endoscopists. The sign test of symmetry between S1 and S2 gives a one-sided p-value of 0.0001. To describe the overall performance of the 21 endoscopists, Figure B.16 displays the ROC curve for each Session obtained by pooling individual confidence scores across endoscopists 5,6 . In the overall comparison, we see that S2 is superior to S1 since its ROC curve is higher everywhere, with AUC-S2= 0.847 and AUC-S1= 0.809. Figure B.17 shows the overall performance separately by expertise of the endoscopists (Expert vs Non-Expert). The AUC for S1 (experts) is 0.853; the AUC for S2 (experts) is 0.869.  ROC curves of a classification model in which "Histologic evaluation" is the binary output and "Confidence scores" are the dependent variables, for S1 and S2. The analysis is conducted separately for each endoscopist.  ROC curves of a classification model in which "Histologic evaluation" is the binary output and "Confidence scores" are the dependent variables, for S1 and S2. The AUC for Session 1 is 0.809; the AUC for Session 2 is 0.847.  Table B.8. Proportions and sample sizes of the correct human diagnosis in S1, S2 and the AI perceived correct diagnosis, against different human confidence levels and the AI perceived confidence levels, respectively, for experts and non-experts. Accuracy s does not consider as wrong lesions where AI opinion was perceived as "uncertain" or was "not noticed". Evaluations of the confidence of the AI were asked only when the opinion was "Adenoma" or "Non-Adenoma" Table B.9. Odds-Ratios (OR) for each endpoint, estimated separately for experts and non experts and histology (adenoma and non-adenoma). We report in brackets the 95% confidence intervals.  ROC curves of a classification model in which "Histologic evaluation" is the binary output and "Confidence scores" are the dependent variables, for S1 and S2. The analysis is conducted separately for experts and non-experts. The AUC for Session 1 (experts) is 0.853; the AUC for Session 2 (experts) is 0.869. The AUC for Session 1 (non-experts) is 0.775; the AUC for Session 2 (non-experts) is 0.831.